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OPTIMAL PROPERTIES OF CENTROID-BASED CLASSIFIERS 
FOR VERY HIGH-DIMENSIONAL DATA 

By Peter Hall and Tung Pham 

University of Melbourne 

We show that scale-adjusted versions of the centroid-based classi- 
fier enjoys optimal properties when used to discriminate between two 
very high-dimensional populations where the principal differences are 
in location. The scale adjustment removes the tendency of scale differ- 
ences to confound differences in means. Certain other distance-based 
methods, for example, those founded on nearest-neighbor distance, 
do not have optimal performance in the sense that we propose. Our 
results permit varying degrees of sparsity and signal strength to be 
treated, and require only mild conditions on dependence of vector 
components. Additionally, we permit the marginal distributions of 
vector components to vary extensively. In addition to providing the- 
ory we explore numerical properties of a centroid-based classifier, and 
show that these features reflect theoretical accounts of performance. 

1. Introduction. 

1.1. Motivation and summary. Suppose we observe samples X and y, 
both consisting of p-vectors, drawn by sampling randomly from respective 
populations and ITy. In this paper we establish optimality properties for 
classifiers based on the centroid method in cases where p is large and sample 
sizes are, generally, much smaller. For the applications we have in mind, 
sample sizes can be quite small indeed; for example, in genomic problems p 
is typically in the thousands or tens of thousands, but training sample sizes 
may be only in the teens, or even less. It is shown that in cases such as this, 
a scale- adjusted version of the classifier is able to discriminate in an optimal 
way between populations that differ in terms of location. Scale adjustment 
removes the tendency for scale to confound location differences when using 
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distance-based classifiers, and permits the method to enjoy high levels of 
performance when location differences are relatively small. 

In order to outline our main results, let us suppose that location differ- 
ences are present in a proportion q of the p components; that both training 
sample sizes are at least as large as 2, and are of similar magnitude, v say 
and that the components of the data vectors are not too strongly correlated, 
in particular that the maximum of the sum of absolute values of covariances, 
against any particular component, is bounded. Then a good classifier can 
correctly distinguish between the populations that correspond to the train- 
ing samples, provided that the size of the location differences is a sufficiently 
large constant multiple of {vpq 2 ) -1 ^ . Moreover, in minimax terms this size 
of distance is the minimum possible for accurate discrimination. 

These results hold for large values of the dimension, p, and in particular 
they are valid in cases where dimension is of larger order than the training- 
sample sizes. However, the results can fail if sample sizes are of a larger order 
than p, for example, if p is held fixed while samples increase. Therefore, our 
results specifically address the case where dimension is high. 

In our lower-bound analysis we impose the condition that q exceeds a 
constant multiple of (v/p) 1 / 2 , thereby preventing sparsity, indexed by q, from 
being too low. This assumption implies that {ypq 2 )~ x l A is bounded above 
by a constant multiple of z/ -1 / 2 , and entails boundedness of the location 
differences. However, our work does not require v, denoting the order of 
magnitude of training-sample size, to diverge; v can be held fixed, although 
it can be chosen to diverge if desired. Therefore our results encompass cases 
where the location differences are bounded away from zero as p increases, 
as well as instances where the differences converge to zero. 

1.2. Interpretation. First we note that results of the type discussed above 
hold only in the very high-dimensional cases that are the subject of our 
work, and not in more conventional settings. To indicate why, let us simplify 
matters by taking q = 1 . In this setting it is readily shown that if p is held 
fixed, but v is permitted to increase, then simple distance-based classifiers 
can detect location differences that are of order z/" 1 / 2 in size. However, fixing 
p and varying v in the convergence-rate formula (z^pq 2 ) -1 / 4 = {yp)" 1 / 4 would 
suggest, incorrectly, that the best rate is only v~ l / A . Therefore the formula 
is not applicable to cases where dimension is much smaller than sample size. 
More specifically, the fact that the critical quantity {vpq 2 )~ 1 ^ involves the 
exponent — j, rather than — ^ which arises in more conventional settings, 
underscores the challenge of undertaking classification using small samples 
of high-dimensional data, rather than large samples of low-dimensional data. 

Among classification problems that are relatively difficult to solve are 
those where the location differences that distinguish the two populations are 
so irregular as to resemble stochastic processes. In such cases, classifiers can 
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readily confuse location differences with additive random noise. Therefore, 
when establishing lower bounds we interpret location differences as random 
variables that have the same distribution (after rescaling) as the noise. Our 
upper-bound results also permit this treatment. 

1.3. Comparison with other classifiers. Other classifiers, for example, 
based on nonparametric function estimation or £;-nearest neighbor meth- 
ods, are competitive under suitable conditions. Several classifiers can be 
interpreted, either explicitly or implicitly, as empirical approximations to 
the Bayes classifier. For example, Stone (1977) discusses empirical classifiers 
based on function approximations, and Cover (1968) and Devroye and Wag- 
ner (1982) address fc-nearest-neighbor methods. Using the latter approach, 
and in low-dimensional settings, if k is chosen to diverge appropriately as 
sample size increases then the classifier can achieve the same first-order 
asymptotic performance as the Bayes method. This is achieved through the 
classifier implicitly estimating the unknown densities, fx and fy, say, of 
the two populations, and using them in a manner which is first-order equiv- 
alent to the Bayes rule, that is, assigning a new data value, Z, to Tlx if 
fx(Z) > fy(Z) and assigning Z to LTy otherwise. The empirical approaches 
suggested by Stone (1977) and Hall and Kang (2005) do this more explicitly. 
If p increases sufficiently slowly as the training sample sizes diverge then em- 
pirical classifiers such as these can strongly outperform the centroid-based 
method. 

However, both explicit and implicit estimation of fx and fy are ineffective 
when the dimension is of the same order as, or of larger than, the sample 
sizes. There, methods such as the centroid-based classifier and the support 
vector machine come into their own. Both these methods exhibit the optimal 
performance expressed by Theorems 1 and 2. In the case of the support 
vector machine we need somewhat more restrictive conditions than those 
that we impose in Section 3, and in particular which require the training 
sample sizes to diverge no more quickly than p 1 / 10 . A proof in that case is 
given in the unpublished Ph.D. thesis of the second author. 

1.4. Related work. The literature on statistical classification is particu- 
larly extensive, and we shall provide here only a brief pointer to relatively 
recent literature. Hastie, Tibshirani and Friedman (2001) give a benchmark 
survey of statistical learning, and Dudoit, Fridlyand and Speed (2002) pro- 
vide an authoritative comparison of the performance of statistical classi- 
fiers. Dabney (2005), Dabney and Storey (2005, 2007), Tibshirani et al. 
(2002) and Wang and Zhu (2007) discuss the application of centroid-based 
classifiers to genomic data. Many other contributions are written from the 
viewpoint of engineering, computer science and other fields, rather than 
statistics, and address applications in areas ranging from image analysis 
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[e.g., Cootes et al. (1993)] and forestry [e.g., Franco-Lopez, Ek and Bauer 
(2001)] to speech recognition [e.g., Bilmes and Kirchhoff (2004)] and chemo- 
metrics [e.g., Schoonover, Marx and Zhang (2003)]. They include work on 
the development of transformation methods for improving classifier perfor- 
mance [e.g., Sinden and Wilfong (1992), Simard, Lecun and Denker (1993) 
and Wakahara, Kimura and Tomono (2001)]. Chan and Hall (2009) provide 
background to scale adjustment. Nearest-neighbor methods are discussed 
by Dasarathy (1990) and Shakhnarovich, Darrell and Indyk (2005). Van der 
Walt and Barnard (2006) give a recent account of classifier performance. 
Duda, Hart and Stork (2001) provide a book-length treatment of classifiers 
in the context of pattern recognition. 



2. Scale adjustment. 



2.1. Scale- adjusted centroid-based classifier. A standard centroid-based 
classifier can be defined as follows. Let X = {X\ , . . . , X m } and y = {Y± , . . . , Y n } 
denote random samples of p- vectors from populations II^ and Hy, respec- 
tively, and write X = m~ l ^ Xi and Y = m~ l Yj for the respective sam- 
ple means. Put 

(2.1) T(Z) = \\Z-Y\\ 2 - \\Z- X\\ 2 . 

Given a new data vector Z from one of the two populations, classify Z as 
coming from ttx if T(Z) > 0, and assign Z to Hy if T(Z) < 0. 

This classifier is used frequently to distinguish between two populations on 
the basis of location differences. In that setting it enjoys good performance 
if the training sample sizes m and n are reasonably large, but in other cases 
its effectiveness can be hampered by excessive scale differences. A simple 
adjustment removes this difficulty Specifically, define 

^ m m p 

? * = 2m(m - 1) ^ ^ ^ (Xilfc " Xj2fc)2 ' 
v ' i l= n 2 =ik=i 

1 n n p 

v > i 1= li 2 =lk=l 
denoting unbiased estimators of 

p p 
4 = Y,E{X ik - EX lk ) 2 , t y = Y^E{Y ik - EY ik f, 

k=l k=l 

respectively. The scale-adjusted form of T(Z), whether defined by (2.1) or 

(2.2) , is 

(2.2) T sa (Z) = T(Z) + m- x r\ - n~ X T Y . 
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Scale adjustments of other distance-based classifiers are also effective, but 
in general the adjustments differ from that given in (2.2). 

From some viewpoints the correction at (2.2) provides an adjustment of 
bias, rather than scale. However, if we were to refer to it as a bias adjustment 
then it might be interpreted as a means of diminishing the effects of differ- 
ences between the locations of populations IIx and LTy. To the contrary, it 
removes the effects of scale in order that location differences might be made 
more pronounced, rather than diminished. 

The quantity T sa (Z) is an unbiased estimator of the signed sum of squares 
of distances among means 

v 

E{T sa (Z) | Z} = s{Z)J2(EX ik - EY ik ) 2 , 
k=l 

where s(Z) = 1 if Z is from Tlx, and s(Z) = —1 if Z comes from LTy. There- 
fore, unlike T(Z), the expected value of which is given by 

p 

E{T{Z) | Z} = s(Z)^2(EX ik - EYik) 2 + n~ l T Y - rrT l T x 
fc=i 

in the centroid method approach, T sa (Z) focuses sharply on component- wise 
differences among means. 

If it should happen that m~ l T x = n _1 Ty, for example, if m = n and 
the populations have identical average scales, then scale adjustment is not 
necessary. In this context our results for the classifier based on T sa (Z), in 
particular result (3.4) in Section 3.2, hold also for the standard classifier 
based on T(Z). 

2.2. Scale adjustment in other contexts. It can be seen from the defini- 
tion of a centroid-based classifier that it endeavors to focus on differences in 
location, rather than in scale. It shares this feature with most other distance- 
based classifiers, for example, the support vector machine and distance- 
weighted discrimination. However, for all these methods, differences in scale 
can confound differences in location to such an extent that the classifier can 
finish up assigning Z to whichever population has least variation, regardless 
of whether Z comes from Tlx or Tly- 

One of the worst offenders in this regard is the standard nearest-neighbor 
method. If the populations 11^ and Tly have component- wise average vari- 
ances equal to a\ and a Y , respectively, and component-wise average squared 
location differences equal fi 2 , then the nearest-neighbor classifier gives asymp- 
totically correct discrimination, as p — > oo, if and only if 



(2.3) 



2 ^ I 2 2 

p, > \a x - a Y 
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If fi 2 < \a\ — Oy| then, with probability converging to 1 as p — >■ oo, the 
nearest-neighbor method assigns Z to whichever of Tlx and Ily has least 
component-wise average variance, regardless of whether Z came from IIx 
or Ily. In contrast to (2.3), the support vector machine and centroid-based 
classifiers require only 

(2.4) p? >\o\mT l -o\rT\ 

where m and n denote the training-sample sizes for Tlx and Ily , respectively. 
[These results hold in cases where p is very large relative to m and n, and 
under conditions discussed by Hall, Marron and Neeman (2005).] From (2.4) 
we see that, for support vector machine and centroid-based classifiers, the 
effects of increasing training-sample size can quickly reduce the impact of 
scale differences. However, in view of (2.3) this opportunity does not arise 
in the case of standard nearest-neighbor methods. In some problems the 
sample size issue is becoming less serious over time, as more data accumulate. 
However in other settings, for example, in the new uses of microarrays, the 
issue of small sample size can still be very important. 

Of course, if we felt that that (2.3) or (2.4) correctly captured the ways in 
which location and scale worked together to jointly characterise populations 
Ux and ny, then we would not introduce the scale adjustment suggested in 
Section 2.1. However, in practice one often feels that the differences between 
populations that are of interest are primarily those of location, not scale. 
For example, this tends to be the case with genomic data. 

The measures of performance discussed above address relatively subtle 
properties, where the "signal" that gives rise to location differences is at 
least bounded, if not small. By way of contrast, some related work on clas- 
sifier performance [see, e.g., Hall, Pittelkow and Ghosh (2007)] addresses 
instances where the signal, when it is present, is unboundedly large, and in 
fact diverges to infinity as p increases. In such cases a scale adjustment is 
not necessary since the effect of uncorrected scale is of smaller order than 
the impact of the signal. 

An alternative approach to scale adjustment is to empirically correct each 
component for scale before incorporating it in the classifier, in the manner 
of a i-statistic. If the scales of different components are genuinely differ- 
ent, for example, with some referring to weight and the others to distance, 
then standardisation is essential. Fortunately, in many of the applications 
to which classifiers are put the components have identical scales. For in- 
stance, in applications to genomic data the jth component of a data vector 
Xi or Yi typically represents the extent to which the jth gene is differentially 
expressed, or "switched on," and is on the same scale for each gene. 

In problems where scale standardizations is necessary, for example, to 
accommodate heteroscedasticity among vector components, small sample 
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sizes can lead to problems when dividing by standard deviation estimators. 
These difficulties can be alleviated by using a ridge parameter or a related 
approach to regularisation, for example, the band-matrix inversion method 
of Bickel and Levina (2008). 

3. Upper bound to classifier performance. 

3.1. Model for data. We use the following data model: 



X ik = 5a k I k + M ik , Y jk = 5b k J k + N jk and Z k = 5c k K k + Q k 
where (a) Mj = (M a , M i2 , . . .), Nj = (N jU N j2 , . . .) and Q = 
{QiiQi,- • •) ar e infinite sequences of random variables with finite, 



are sequences of constants and I\ , I 2 , ■ ■ ■ and J\ , J 2 , . . . are sequences 
of zeros and ones, (d) 5 > is a deterministic function of m, n and 
p, (e) min(m,n) > 2 and (f) either (c k ,K k ) = (a k ,I k ) for all k, or 
(c k ,K k ) = (b k , J k ) for all k. 

In particular, we make no assumptions about the relationships among the 
noise distributions for the X and Y populations. For example, we do not ask 
that the distributions of Mi, Ni and Q be related in any sense. Condition 
(e) is needed so that we can estimate the scale of the data; variability gen- 
erally cannot be accessed empirically if either m or n equals 1. However, (e) 
is unnecessary if m" 1 ^ = n -1 Ty and we use the classifier based on T(Z), 
rather than on T sa (Z). Condition (f) asserts that the pattern of the com- 
ponent means, 5c k K k , for the new datum Z is identical to that for either 
the X or the Y data. In particular, we describe differences between the two 
populations only in terms of location differences. 

It might be thought that in the latter respect, the nonadjusted classifier 
based on T{Z) enjoys potential advantages since it is influenced by differ- 
ences in scale as well as differences in location. However, the nonadjusted 
classifier can actually be seriously misled by scale differences. See, for exam- 
ple, Chan and Hall (2009). 

3.2. Main results. Define v = min(m,n). We assume that, for all k > 1, 
fourth moments of M\ k and A^fc exist, and second moments of Q k exist; 
and, more specifically, that the constants 



(3.1) 



zero means, (b) M\,M 2 , ■ ■ ■ are independent and identically dis- 
tributed, Ni,N 2 , ■ ■ ■ are independent and identically distributed and 
the Mj's, the iVj's and Q are independent, (c) ai, a 2 , ■ ■ ■ and 61, b 2 , ■ ■ ■ 



Di = sup max sup 

p>l ki>l 




(3.2) 
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SU P 5J|cov(iVi fcl ,iVifc 2 )|,sup \ cm (Qki,Qk 2 )\ 



fcl ^* 1 t -i fcl >1 i 



=1 



(3.3) Z?2 = sup max < sup 

p>l \ki>l 



k 2 =l 



sup 

fcl>l 



fc 2 =i 



are finite. Empirical evidence indicates that correlations among gene ex- 
pression levels are often quite low, for example, in the range 0.08 to 0.01 
at distances of between two and 10 base pairs, respectively [Mansilla et al. 
(2004), Messer and Arndt (2006)]. More generally, decay can occur at either 
an exponential or a reasonably fast polynomial rate [Almirantis and Provata 
(1999)]. 

This condition amounts to an assumption about the strength of depen- 
dence among the components of data vectors. To illustrate the implications 
of the condition we note that if the processes {Mu, . . . , Mi p }, {N±\, . . . , Ni p } 
and {Qi, . . . , Q p } are all stationary and Gaussian, all with zero means and 
the same autocovariance function = cov(Qk,Qk+j)> then finiteness of 
D\ and D2 is equivalent to convergence of the series X^ ItC?)!- This is a mild 
assumption; the covariance can decay as slowly as j" 1 ^ , for any r\ > 0, and 
Theorem 1 will hold. 

Define dk = aklk — bkJk, d=(d\,..., d p ) and \\d\\ 2 = d\. Let T(Z) and 
T SSu (Z) be as at (2.1) and (2.2). In particular, T(Z) is the centroid-method 
classifier. A proof of the following theorem is given in a longer version of 
this paper [Hall and Pham (2009)]. 

Theorem 1. Assume the model at (3.1), and in particular suppose that 
(a) -(f) there hold. Then there exists a constant B > 0, depending only on 
D 1 and D 2 at (3.2) and (3.3), such that 

(3.4) EiT^-PsiZm^^B^p + d^df). 

Under the same assumptions, except that condition (e) min(m,n) >2 can 
now be dropped, we have instead of (3.4), 

(3.5) E{T{Z) - 5 2 s(Z)\\d\\ 2 - \{rrr l T 2 x - n" 1 ^} 2 < B(y~ l p + b 2 \\df). 

3.3. Implications for probability of correct classification. Assume for sim- 
plicity that Ik = Jk for each k. (The latter condition implies that the "signal" 
is present at the same locations in the X and Y populations.) Suppose too 
that 

(3.6) Wipq < \\d\\ 2 < W2pq, m + n < W2 min(m, n) = W^v, 

where < W\ < W2 < 00 are constants, and q € (0, 1] is an "index of spar- 
sity." For example, if Ik 7^ for just pq values of k, and if the sum of 
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(cifc — 6fc) 2 jpq over these indices is bounded away from zero and infinity, then 
the first part of (3.6) holds and q denotes the proportion of components, in 
either the X or Y populations where the signals have an opportunity to be 
nonzero. Of course, we permit q to vary with p as the latter increases. 

We also assume that v < Cp where C > is a positive constant. There- 
fore, the number of dimensions is at least as large as a constant multiple 
of sample size. Without this condition, the results that we shall describe 
below are generally false. For example, they fail if p is held fixed as v varies. 
In that setting it is readily shown that a classifier can detect alternatives 
distant z^ -1 / 2 , rather than z; -1 / 4 , apart; the latter result would follow from 
the results given below if we were to take p and q fixed and permit v to 
increase. These differences point to the intrinsic difficulty of undertaking 
classification using high-dimensional data in small samples, as distinct from 
low-dimensional data in large samples. 

Take 5 = c(vpq 2 )~ 1 / 4: where c > denotes a fixed constant. Let M. = 
A4(C, D\, L>2) W\, W2) denote the set of all models prescribed by the con- 
straint v < Cp [where it is assumed that (3.2), (3.3) and (3.6) hold for the 
constants D\, D2, W\, W2] and by conditions (a)-(f) in (3.1) [where we take 
5 = c{vpq 2 )~ l / A , for c > fixed]. Then the following result holds [see Hall 
and Pham (2009) for a proof]: 

Corollary 1. // (3.4) and (3.6) hold then 
lim limsup sup {P(the classifier T sa assigns Z to Hx\Z £ LTy) 
(3.7) 

+ P(the classifier T sa assigns Z to Hy\Z € ITy)} = 0. 

That is, if the signals are distributed with sparsity q and are of size ap- 
proximately c{upq 2 )~ l l 4 , then the probability that the classifier based on T sa 
makes the incorrect decision can be rendered arbitrarily close to for all 
sufficiently large p and uniformly over all models in the class A4 by taking 
c sufficiently large. 

Results such as (3.4), (3.5) and (3.7) all have analogues in settings where 
the "constants" and bk are interpreted as random variables. See, for 
example, (4.5) in Section 4. 

Generally speaking, (3.7) fails if the scale adjustment suggested in Section 
2.1 is not incorporated, unless v is at least as large as a constant multiple of p. 
Indeed, it can be shown that if |m _1 r^ — n~ l Ty \ is larger than a sufficiently 
large constant multiple of 5 2 ||(i|| 2 (and this condition is often satisfied if 
v < const, p), then the probability of misclassification can be bounded away 
from zero as p diverges. These results point to the desirability of including 
the scale adjustment when defining the classifier. 
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4. Lower bound to classifier performance. 

4.1. Data model for lower bound. Assume we observe 

(4.1) X ik = 6A k I k + M ik , Y jk = 5B k J k + N jk , Z k = 5C k K k + Q k , 

where (i) 1 < i < m and 1 < j < n; (ii) the random variables A k , B k , Mi k , 
Nj k and Q k are normal iV(0, 1); (iii) these variables, and I k and J k , are 
totally independent for 1 < % < m, 1 < j <n and 1 < k < p; (iv) I k and 
J k are identically distributed, with P(I k = 0) = 1 — q and P(I k = 1) = q, 
(v) 5 > and < g < 1 and (vi) either (C k ,K k ) = (A k ,I k ) for all k, or 
(C k ,K k ) = (B k ,J k ) for all k. It is desired to distinguish between the two 
cases in (vi) using only the data at (4.1). For example, determining that 
(C k ,K k ) = (A k ,I k ) corresponds to classifying Z k as coming from the X 
population. We permit m, n and q to depend on p, which we take to diverge 
to infinity. 

By permitting q to converge to zero as p diverges we can ensure a degree 
of sparsity in the signals. However, we do not insist that q becomes small 
as p increases; for example, our assumptions permit q to be held fixed, at 1, 
for all p. 

Provided the likelihood-ratio statistic is asymptotically normally distributed, 
that quantity provides asymptotically optimal discrimination between the 
cases (C k ,K k ) = (A k ,I k ) and (C k ,K k ) = (B k ,J k ) in (4.1). A necessary con- 
dition for asymptotic normality is 

(4.2) max(m + l,n + 1)<5 2 <C, 

where C > is arbitrary but fixed. We shall make this assumption. 

To indicate the implications of (4.2) we note that when this condition 
holds, the bias and error-about-the-mean contributions to the likelihood- 
ratio statistic are of sizes uj = mpq 2 5^ 1 and w 1//2 , respectively. Therefore, 
if uj is small then the bias, which reveals the difference between the cases 
(C k ,K k ) = (A k ,I k ) and (C k ,K k ) = (B k ,J k ), is submerged in noise, and it 
is impossible, even when using the likelihood-ratio method, to distinguish 
effectively between the cases. On the other hand, if uj is large, then the cases 
can be distinguished with high probability. It is in the intermediate setting, 
where u is not far from 1, that classification is marginal; see Theorem 2, 
below. In such instances, if it should be the case that m/(pq 2 ) diverges 
along a subsequence, and if 5 = c{mpq 2 )~ l l 4: as in Theorem 1, then m<5 2 
must also diverge along that subsequence, contradicting (4.2). Therefore the 
context of our work implies that m/(pq 2 ) is bounded, which in turn entails 
a lower bound to sparsity; for a constant C > 0, 

(4.3) C(m/p) 1/2 <q< 1. 
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4.2. Optimal convergence rates for the model at (4-1)- Write Px and 
Py for probability measure under (4.1) in the respective cases Ck = Ak and 
Ck = Bk- Let x denote a measurable function of the data Xik (for 1 < i < m), 
Yjk (for 1 < j < n) and Zk, all for 1 <k <p. Let x, a random quantity, be 
a measurable function of the data X^, Yjk and Z^, for 1 <i <m, 1 < j <n 
and 1 < /c < p, and taking only the values X and In particular, x can De 
interpreted as a classifier that ascribes Z to either LTx or Hy. Write C for 
the set of all such classifiers. 

The theorem below asserts that, unless 5 is a relatively large constant 
multiple of (mpg 2 ) -1 / 4 , no classifier can effectively distinguish between the 
cases (Cfc,iCfc) = (Ai^If.) and (Ck,Kk) = (B^jJ^). Together with Theorem 
1 it shows that the scale-adjusted classifier introduced in Section 2.1 has an 
asymptotically optimal ability to distinguish between the two populations. 

Take S in (4.1) to be given by 5 = c(mpq 2 )~ l / 4: where c > is fixed. 

Theorem 2. Assume the model in Section 4-1, and in particular suppose 
that (i) -(vi) there hold. Suppose too that, as p diverges, the positive integers 
m and n, and q G (0,1], are such that (4-3) holds for a constant C > 0, 
and the ratio m/n is bounded away from zero and infinity. Then, for all 
sufficiently small c > 0, 

(4.4) liminf inf {P x (x = B) + P Y (x = A)} > 0. 

n-s-oo X £C 

The assumption that the "signals," represented by the terms 5Ak and 
5Bk in (4.1), are random, gives them an irregular character and makes clas- 
sification relatively challenging. If we take A^ and B^ to be fixed constants, 
not depending on k, then the classification problem is significantly simpler, 
and successful classification is possible for values of 5 that are an order of 
magnitude smaller than those discussed in Theorem 2. In the model intro- 
duced at (3.1) we effectively conditioned on A^ and Bk, treating them as 
constants and b^. This is a minor alteration, however. In particular, (3.4) 
continues to hold if we give and bk the distributions of random variables, 
for example, as in point (ii) immediately below (4.1), and if we take expec- 
tations on both sides of (3.4). Arguing in this way the following analogue of 
(3.7) can be derived under the assumptions of Theorem 2. 

Theorem 3. Assume the conditions of Theorem 2, and in particular 
that 5 in (4-1) is defined by 5 = c(mpq 2 )~ 1 l 4: . Then 

(4.5) lim liminf min[P x {T sa (Z) > 0},Py{T sa (Z) < 0}] = 1. 

c— >oo p— >oo 

Together, (4.4) and (4.5) establish optimality of the centroid-based clas- 
sifier. 
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5. Numerical properties. An extensive simulation study is summarised 
by Hall and Pham (2009). It treats both moving-average and GARCH mod- 
els Fan and Yao (2003) for the data vectors M, N and Q, and provides 
numerical evidence of theoretical properties reported in Sections 3 and 4. 
For example, it shows, as argued in theoretical terms in Corollary 1, that 
if the value of S, in the model at (3.1), is chosen so that a given, fixed per- 
centage of classifications is correct, then 5 changes with m in proportion to 
m 1//4 if p (the dimension) and q (the level of sparsity) are kept fixed. 

Below we report the results of sampling experiments performed using the 
KDD 2008 dataset. The data are available at http://www.kddcup2008.com 
and contain information derived from X-ray images of breast cancer patients. 
Two supplementary files are also provided, Features.txt and Info.txt. 
The Features file contains information about 102,294 suspicious regions, 
each described by p = 117 features. The Info file provides additional in- 
formation about each region in the Features file. The latter file gives 11 
columns describing 11 characteristics of each region. For example, the first 
column contains labels that indicate whether the corresponding region was 
malignant or benign. To simplify the classification problem we used only 
information about this label (i.e., malignant or benign) of each region, and 
ignored other information in the Info file; we used the label information 
only to create the samples and to assess classifier performance. Our dataset 
therefore contained 623 data vectors corresponding to malignant regions, 
and 101,671 vectors from benign regions (623 + 101,671 = 102,294). 

We used the KDD data to compare five methods: scale-adjusted versions 
of the nearest neighbor; (NN) support vector machine (SVM) and centroid- 
based classifiers; the scaled variance (SV) classifier for which the analogue 
of T sa was 

(5.1) T SV (Z) = {Z- Y) T %y\Z -Y)-{Z- X) T E X \Z - X) 

and the naive Bayes classifier. Definitions of the two first-mentioned clas- 
sifiers are given by Chan and Hall (2009). The naive Bayes classifier was 
constructed under the assumption that all data were normally distributed 
and employed a ridge parameter. See the last paragraph of of this section 
for details of the ridging method. In constructing the SV classifier we com- 
puted Y>x and £y, in (5.1), using the training data from lix and ILy, re- 
spectively, and employing the band-matrix approach studied by Bickel and 
Levina (2008) with a single band on either side of the main diagonal. Using 
a single band was appropriate for the small training-sample sizes (3, 5, 8, 15 
and 20) encountered with the breast-cancer data. 

Training and test datasets were generated and used to assess the five clas- 
sifiers, as follows. Throughout we took m = n. We randomly selected m data 
vectors from the 623 that represented malignant regions; we similarly chose 
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n from the 101,671 that represented benign regions; we constructed the clas- 
sifier from these data, and we applied it repeatedly to the remaining 623 — m 
data from malignant regions and to a randomly chosen subset of 623 — n 
data from the remaining benign data. (Trialling the classifier against all the 
remaining benign data, i.e., the 101,671 — n benign data not used to con- 
struct the classifier, was too time consuming, so we reduced the number to 
623 — n, matching that for the malignant data.) This operation was repeated 
2000 times, and the error rates averaged to produce the figures discussed be- 
low. Note that this procedure gave the two populations prior probabilities 
of ^ each, rather than the very disparate values of 623/102,294 = 0.006 and 
101,671/102,294 = 0.994 that would otherwise have prevailed. 

Next we summarise the main results. When the common value of m and 
n was between 15 and 20 the classifiers gave remarkably consistent results 
over all the settings we treated. In particular, when applied to data from the 
malignant region the success rates of each of the five classifiers (centroid, 
SVM, NN, SV and naive Bayes) was in the range 71% to 74%. The ranked 
order of the classifiers varied from one situation to another, but the centroid- 
based classifier was almost invariably ranked first. On the other hand (but 
still for m and n between 15 and 20), when applied to data from the benign 
region the five classifiers always separated into two clusters on the basis of 
performance. The centroid and the SV and naive Bayes methods were in the 
highest-ranked cluster with the centroid method invariably outperforming 
the naive Bayes approach and the SV method performing close to the cen- 
troid method, each having between 72% and 83% success rate. Both of the 
other two classifiers performed noticeably worse with between 51% and 58% 
success. Among the latter two methods, either could outperform the other 
when applied to data from the benign region. 

At the extreme of relatively low sample size, and in particular when the 
common sample size was between 3 and 8, the performance of all classifiers 
deteriorated and the patterns noted above largely disappeared. For m and 
n between 5 and 8, and in applications to data from benign regions, the 
centroid, SV and naive Bayes techniques maintained their superiority over 
the other two, with the centroid-based method almost invariably the winner. 
However, in the case of smaller sample sizes the naive Bayes approach had 
worst performance of all, in both the malignant and benign cases. Here, m 
and n were far too low for the assumption of normality, on which the naive 
Bayes method is based, to be even approximately valid. In the case of data 
from malignant regions the support vector machine also gave good results, 
being the second best performer behind the centroid method. 

Next we give a little more detail in specific cases, starting with the case 
where m = 20. When applied to classify data from malignant regions, the 
following ranking of classifiers in decreasing order of performance was found: 
centroid-based method, NN, naive Bayes, SV and SVM. When applied to 



14 



P. HALL AND T. PHAM 



classify data from benign regions, we found the following rank order: cen- 
troid, SV, naive Bayes, SVM and NN. The reasonably good performance of 
the naive Bayes classifier here was due partly to the fact that when m = 20, 
validity of the assumption of normality was aided by the central limit the- 
orem. In the case of the SV method the larger sample size helped when 
estimating the covariance matrix. The situation changed markedly when 
sample sizes were reduced to m = 5. There the SV and Bayes methods had 
significantly more difficulty estimating variance and covariance, to such an 
extent that using a ridge was essential to obtaining even mediocre perfor- 
mance. When m = 10 the Bayes classifier was inferior to each of the other 
four methods when the data were from malignant regions, and it ranked 
third, behind the centroid and SV methods, in the case of data from benign 
regions. 

We also explored in more detail the effect of using a ridge parameter to 
construct the naive Bayes classifier. The ridge was added to conventional es- 
timators of variance, and we sought values of the ridge in the interval [0.01, 1] 
that maximised classifier success rate, averaged over the malignant and be- 
nign cases and for the given choice of m. (To put the choice of interval into 
context we mention that the component-wise average empirical variances of 
the datasets, for benign and malignant regions, respectively, were 1.00 and 
1.21.) Our numerical experiments showed that, when m = 3 and the ridge 
was chosen optimally, the average success rate of the naive Bayes classifier 
increased from about 50% to 68%. However, when m = 5 the average suc- 
cess rate of the naive Bayes classifier increased by only 6%, and the amount 
of increase declined steadily as m increased; it was only 2% when m = 20. 
Of course, these results are the best possible ones when the ridge is chosen 
deterministically. In practice the ridge has to be selected empirically, and, 
especially when m is small (e.g., m = 3 or 5), empirical choice of ridge can 
actually lead to a deterioration in classification performance, since it adds 
extra noise to the classifier. 



6. Proof of Theorem 2. 



6.1. Likelihood when (C k ,K k ) = (A k , I k ). Let (ft denote the standard nor- 
mal density. The joint density of (for 1 < i < m), Yj k (for 1 < j < n) and 
Z k , for fixed k, equals 



E 



(6.1) 



~[(f>(x ik - 5A k I k ) \ i Yl fiiVjk ~ & B kJk) >4>( z k ~ $C k K k ) 



^=1 



II ^) 1 1 II 4>(yjk)j4>(z k )E(c k ), 



.1=1 
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where 

C k = expj - X -b\raA\ll + nB 2 k J 2 k + C 2 k K 2 k ) 

(m n \\ 

A k I k ^2 x ik + B k J k ^2 Vjk + C k K k z k > . 

Put X k = {xi k ,. . .,x mk }, y k = {yi k , . . .,y n .k}, S k = Y^i x ik and T k = YljVjk- 
[Here we keep the data fixed, and so denote them by lower case letters, but 
from (6.4) down we shall give the data the joint distribution determined by 

(4.1) , and from that point we shall use upper case letters.] If (C k ,K k ) = 
(A k ,I k ), then 

E(c k \x h ,y k ,z k ) 

(6.2) =E[exp{-±5 2 (m + l)A 2 I 2 + 6A k I k (S k + z k )}\X k ,y k ,z k ] 

x E{ex V (-¥ 2 nB 2 k Jl + 5B k J k T k ) \ X k ,y k ,z k }. 
For r, s > and real t, 

£ jexp^-VsiV 2 + rtNj J = expQ-^^ (r 2 s + l)" 1 ' 2 . 

Hence, by (6.2), 

ipi(x k ,y k ,z k ) 

= E(c k I x k ,y k ,z k ) 

(6.3) 

f 1 5 2 

l-q + q{(m + l)S 2 + l}" 1 / 2 exp<| --——^—(S k + z k ) 

c2 



2(m + l)<5 2 + l 
xjl-. + ^ + ir^expQ-^T 2 )}. 



Combining this result with (6.1) we conclude that the likelihood of (X k , y k , Z k ), 
under the assumption that (C k ,K k ) = (A k ,I k ) is 

(6.4) jn<K*ifc) 1 1 n J# Wi (x kt y k , z k ). 

6.2. Likelihood ratio. It follows from (6.4) that the ratio of the likeli- 
hoods of (X k ,y k ,Z k ), for (C k ,K k ) = (A k ,I k ) versus (C k ,K k ) = (B k ,J k ), is 

ia k\ ii 7 \ ^\{x k ,y k ,z k ) 

(6.5) Pk{X k ,y k ,Z k ) - 



ip2(x k ,y k ,z k y 
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where, by symmetry from (6.3), 

ip2(x k ,yk, z k ) 



(6.6) 



1 - q + q{(n + l)S 2 + 1}" 1/2 exp<{ £ 



x <j 1 - q + q(m5' z + 1)" 1/2 exp I ^ 



1 £ 2 
2(n+l)5 2 + l 

5 2 



(T k + Z fc ) 2 



2 mb 2 + 1 



5 A . 



The likelihood ratio for the full dataset, {(X k ,y k , Z k ) : 1 < k < p}, is given 
by 

p 

(6.7) P=Hp k (x k ,y k ,z k ). 

k=l 

6.3. Properties of p when (C k ,K k ) = (A k ,K k ). Assume that (C k ,K k ) = 
(A k ,I k ) for all k. In this case, writing N for a normal iV(0, 1) random vari- 
able, and interpreting S k , T k and Z k as random, we have the following: 

5 2 



E 



2{m + l)5 2 + l {Sk + Zk) 



exp 



E 



+ qE exp 



1 



2 (m+ 1)5 2 + 1 
l{m + l + (m + l) 2 5 2 }6 2 at2 



exp 



2(n + l)5 2 + l 
(l- 9 ) 2 £ 



(m + l)<5 2 + 1 



1 (n+l)5 2 Ar2 
exp< - , , i r iV 



+ <f£ exp 



+ q(l-q)[E 



2(ra + l)<5 2 + l 

l {n + l + (n 2 + l)<5 2 }£ 2 2 
2 (n + l)5 2 + l 

1 (n + l + n 2 5 2 )8 2 Ar2 



exp 



2 (n + l)<5 2 + l 

l (n + l + 5 2 )S 2 A 
2 (n + l)<5 2 + l 



■AT' 



i^expQ 



i C2 

2mf + l fc 



/ 1 m8 2 



1 (m + m 2 5 2 )<5 2 , t9 
^2 ^ 
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El exp - 



1 5 2 



2n5 2 + l k 
(1 — q)E< exp 



1 n<5 2 



2n5 2 + l 



+ qE 



exp 



l(n + n 2 ^)<5 2 Ar2 

— N 

2 n<5 2 + 1 



Note too that, for c < 1, £{exp( ±ciV 2 )} = (1 - c)" 1 / 2 . Therefore, 



E 



1 ^ 



■(S fc + Z fc )' 



exp 



E 
E 



ex P t: 



CXp 



2(m + l)<5 2 + l 
{(m + 1)5 2 + l} 1 / 2 ^ - g + g{l - (m + 1) 2 5 4 }~ 1/2 ], 

{^(n + lV + l (Tfc + Zfc)2 }. 

{(n + l)8 2 + l} 1 / 2 ^ - g) 2 + q 2 {l - (n 2 + l^}' 1 / 2 

+ - 9 ){(1- 5 4 )" 1/2 + (l-n 2 5 4 )- 1 /2}), 

2m<5 2 + l fc 

1 <5 2 2 
2n5 2 + l k 



= (m5 2 + 1)V 2 {1 - g + g(l - m 2 ^)" 1 / 2 } 

:(n5 2 + l) 1 / 2 {l-g + g(l-n 2 <5 4 )- 1 / 2 }. 

From these results, (6.3) and (6.6) we see that, if we define 

A s , k = (md 2 + 1)-Va(i _ ^expfi-J^— S 2 

\2 mr + 1 



1 S 2 

2 n<5 2 + 1 



A T ,fc = (n<P + l)- 1/2 (l-£)exp( 
&sz,k = {(m + 1)<5 2 + lr^Cl - £0 expj i 
A r z,fc = {(n + 1)<5 2 + 1}~ 1 /2 (1 _ exp | 1 

(m5 2 + l)" 1 / 2 ^! 
(l-rnV)- 1 / 2 -! 
(n^ + l)- 1 / 2 ^^ c 

(l-n 2 J 4 )- 1/2 -l» 



(m+ 1)5 2 + 1 
<5 2 



(5 fc + Z k f 



(n+l)5 2 + l 



(T k + z k y , 



exp 



1 ^ T 2 

2n<5 2 + l fc 
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PSZ = U(m + 1)5 2 + \Y XI2 E 
= {l-(m + l) 2 5 4 }- 1/2 -l, 



2(m + l)5 2 + l 



(Sk + Zf.y 



PTZ=({{n + l)5 2 + iy 1 ' 2 E 



exp 



-{T k + Z k f 



" 1 ) <?_1 



2 (n + l)5 2 + 1 

= (l-* 4 )- 1 /2 + (l_ n 2 < j4 ) -l/2_ 2 

+ q [l + {1 _ ( n 2 + l)^}-l/2 _ (1 _ giyl/2 _ (1 _ n 2 (5 4 ) -l/2 ]) 

we have, 

M*k,yk, Z k ) = (1 + gVsz + gA SZiJfc )(l + g 2 /i T + 9 A T ), 
ip2(Xk,yk, z k) = (! + q 2 Prz + qA SZtk )(l + <?Vs + gA 5 ). 
Hence, by (6.5) and (6.7), 

= tt (1 + gVsz + gA sz , fc )(l + g 2 /i T + gA-r) 
P (! + 9 2 Mtz + ?A T s,fc)(l + gVs + gA 5 ) ' 

6.4. Expansion of likelihood ratio. Throughout this section we impose 
the condition, given in Theorem 2, that 5 = c(mpq 2 )~ l l A . The quantities 
Ms, Mt, Msz, Mtz, var(A s ), var(A T ), var(A 5Z ) and var(A TZ ), and their 
counterparts in the case where (C k ,K k ) = (^4fc, -Kfc), are all well defined and 
finite if and only if, for some d 6 (0, 5), 

(6.8) max(m + l,n + 1)<5 2 < d. 

This inequality follows from (4.3) and the assumption 8 = c(mpg 2 ) -1 / 4 , 
vided c> is sufficiently small. In this setting we can write 



pro- 



(6.9) 
where 

(6.10) 



P — Pbias Perron 



Pbh 



Pc 



(l + q 2 p S z)(l+q 2 pT) y 
(l + q 2 p TZ )(l + q 2 ps)) 



(1 + qA sz ,k/(l + q 2 Psz))il + qA T)k /(l + q 2 ^)) 



k ^ (1 + qA TZtk /(l + q 2 p TZ ))(l + qA s , k /{\ + q 2 p S )) 

denote, respectively, the dominant bias term, and the dominant error-about- 
the-mean term in an expansion of the likelihood ratio p. We consider two 
cases: 

(i) The ratio m/n is bounded away from zero and infinity as n — > 00. 



CLASSIFICATION 

(a) If m5 2 — > 0, then 

l + fi S z = {l-(m+l) 2 5 4 }- 1 / 2 

(2m + l)5 4 
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(6.11) 



whence 



(6.12) 



(l-m 2 5 4 )- 1/2 1 



1 - m 2 <5 4 



-1/2 



(1 - m 2 5 4 )" 1/2 jl + ^(2m + 1)5 4 + 0(m 3 <5 8 ) j 



AiTZ = fir + ( 1 + 2<$ 4 ) -l + 0(<f 8 



+ (l_„2 J 4 ) -l/2 ! 



«5 4 



1 - ra 2 <5 4 



-1/2 



1+ m+- g 2 <5 4 + 0(m 3 <5 8 ), 



l + q vsz 



— — = 1 - -q 5 +0(q n 5 , 

1 + q z fi T z 2 



Pbia 



/ 1 + gVgg 1 + gVr x? 



= exp{mpq 2 5 4: + o{mpq 2 5 A )}. 

To treat p crm v, note that 

{(m + 1)5 2 + l}" 1 / 2 = (m5 2 + iy l l 2 {l + 0(5 A )}, 



eX Pi o 



2 (m+ 1)5 : 



x (l+m5 4 R k ), 



where, here and below, i?i,i?2, ■ • ■ is a generic sequence of independent and 
identically distributed random variables, depending on 5 but for which, for 
each r > 1, absolute moments of order r are uniformly bounded provided 5 
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is sufficiently small. Therefore, recalling the genericity of the notation R k , 

{(m + 1 )5 2 + 1} -i/2 exp |i , A Sk + z k ) 2 



2 (m + 1)<5 2 + 1 

(m5 2 + l)- 1 /^/! f (5 fc + Z fc ) 2 )(l +mS 4 R k 
I 2 mo z + 1 J 

I (25 fc Z fc + Z 2 ) + m5 4 J R fc |. 



1 5 2 

x 1 + 



Hence, 



2 m<5 2 + 



A S z, k = A 5 , fc + (1 - £)M 2 + i)-V2 exp Q_iL_ 5 2 

(25 fc Z fc + Z 2 )l + (1 - E)m8 A R k , 



2 m<5 2 + 1 

and from (6.11), fisz = Us + 0(w,<5 4 ). Therefore 
1 + q- 7T = l+q 



l + q 2 vsz l + q 2 ^s 

+g(1 - £) (^ + i)-v 2exp Q_JL_ s? ) 

X | ^ m|S f +1 (2S t Z t + Z|)| + (1 - E)mqS l R t , 
whence, since As,k = (1 — E)m5 2 R k , 

i c l + gA SZ;fc /(l + g 2 /ig Z ) 4 
(6-13 — — t 2 — r- = 1 + U h + (1 - £ mg<5 R k , 



where 



C/ fe = q(l - E){m5 2 + l)" 1 / 2 exp Q-i^-5 2 ) 

1 <5 2 

O i 1 ( 2S k Z k + ^ 

2 mr + 1 

Analogously, 

l + gA TZ;fc /(l + g 2 /i TZ ) 4 
(6-14 —r- — r 77--. , 2 — r- = 1 + V* + (1 - £)ng<5 # fc , 
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where 

V k = l + q(l- E)(n5 2 + exp Q_JL_ T ^ 
1 <5 2 

Since the assumption that m5 2 is bounded entails p 1 ' 2 mqS 4: = o(m 1 / 2 p 1 / 2 g5 2 ), 
then (6.10), (6.13) and (6.14) imply that 

A (1 + gA sz ,k/(l + gVsz))(l + g A T|fc /(l + gVr)) 

Pcrror (1 + ^^/(j + g2/iTz))(1 + gA ,. fe/(1 + ,2^)) 

P 

(6.15) = {1 + Opip^mqS 4 )} + U k )/(1 + F fc )} 

fc=i 

= exp j J2(Uk -V k )-\ ~ V 2 ) + o p (l) \ . 

{k=l k=l J 

Now, W = J2j~(Uk — Vk) is asymptotically normal N{0, (m + n)pg 2 <5 4 }, and 
Yl,k^Pk ~ Vk) = ( m ~~ n)pq 2 5 4 + o p (l). These properties and (6.15) imply 
that 

(6.16) p CTTOI = exp{W -\(m- n)pq 2 5 A + o p (l)}. 
Combining (6.9), (6.12) and (6.16) we deduce that 

(6.17) p = exp[N{(m + n)pq 2 5 4 } 1/2 + \(m + n)pq 2 5 i + o p (l)], 

where iV is asymptotically normal iV(0, 1). Therefore, if x is taken to be the 
likelihood-ratio classifier then for all values of c that are sufficiently small 
to ensure that (6.8) holds for some d < ^, then 

(6.18) liminf{P x (x = B) + P Y (x = A)} > 0. 

n— »oo 

This establishes Theorem 2 in the case where m5 2 — > 0. 

(b) If t\ = m5 2 and £2 = ra<5 2 — > converge to finite, nonzero constants, both 
of them strictly less than 1, then 

l + ^ = {l-(m + l) 2 ,5 4 }- 1 / 2 

(2m + 1)<5 4 I ~ 1/2 



(l-^)-V2 1 



V 

2 



1 + ii s + (1 - i 2 )-^ 2 ( m + I ) 5 4 + 0(m 2 5« 
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Vtz = ^t+ ( 1 + -5 4 ) -l + 0(<^ 



i-fi + M + (i-^ 4 r 1/2 (fi 



+ q 



HT + \5 4 [l + q{{l-llr z/2 -l}]+0{5* 



n 



2J 4 



-1/2 



whence 



1 + q 2 vsz 
1 + q 2 ns 

1 + gVr 
1 + q 2 fi T z 

Pbias 



1 , 2c4 (l-^)- 3 / 2 (m + l/2) 

* 1 + ^{(1 - £f)-V2 _ 1} 

+ 0(q 2 m 2 6 s ), 

_ 1 2 l + g {(l-£2)-3/2_ 1} 
2^ l + g 2{ (1 _ ^2)_l/2 _!} 

l + gVl + g 2 /ir | p 
1 + gVrz 1 + q 2 ns / 
exp{Limpg 2 (5 4 + o(l)}, 



+ 0(^ 8 ), 



where 



(l_^2)-3/2 



l + crHU-a-vs-i}- 



Compare (6.12). A similar argument can be used to derive an analogue of 
(6.15) in this setting, giving, via (6.9), the following analogue of (6.17): 

p = exp[N{(L im + L 2 n)pq 2 5 4 } l/2 + \{L x m + L 2 n)pq 2 5 A + o p (l)] , 

where N is asymptotically normal N(0, 1). Result (6.18) follows as before. 
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